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Epilepsy causes repeated seizures in an individual's life, which causes 
transient irregularities in the brain's electrical activity. It results in different 
physical symptoms that are abnormal. Various antiepileptic drugs fail to 
minimize repeated patient seizures. The electroencephalogram (EEG) signal 
recordings provide us with time-series data set for epileptic seizure detection 


and analysis. These signals are highly nonlinear and inconsistent, and they 
are recorded over time. Predicting the ictal period (seizure period at the time 
Keywords: of epilepsy) is thus a challenging task in the naked eye for the medical 
practitioners. Various machine learning techniques are applied to identify 
the seizure's occurrence and its classification in multiple domains. 
: : A classification model based on extreme gradient boosting (SCLXGB) is 
Extreme gradient boosting proposed here for the classification of the EEG signals. The SCLXGB model 
Seizure classification implements binary seizure classification on the benchmark dataset. 
XGBoost Compared with K-nearest neighbor, linear regression, and Decision tree- 
based models, the proposed model achieves the best area under receiver 
operating curve (AUC) of 0.9462 and an accuracy of 96% which signifies 
accurate prediction of seizure and non seizure period. The proposed model 
SCLXGB was validated by taking different performance metrics to indicate 
the occurrence and non-occurrence of seizures in patients more 
appropriately. 
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1. INTRODUCTION 

Various recording systems are available nowadays to record the human brain signal in multiple 
forms. These signalling systems have many advantages and disadvantages [1]. For example, an 
electroencephalogram (EEG) is a signalling method that is more prone to processing steps such as 
enhancement, extraction of features, and classification for epileptic seizure prediction. Therefore, the EEG 
signals have higher temporal resolution and safety compared to other methods. 

Electroencephalogram (EEG) recording is done through the standardized electrodes placed non- 
invasively on the brain's scalp. The clinical inspection for seizure occurrence is visualized by the cerebrum's 
electrical action, which contains much data about brain activity [2]. In this way, EEG signals have an 
incredible significance in brain diagnosis in establishing pre-estimated epileptic seizure detection [3]. 
The EEG varies from average to spike, sharp, spike-and-moderate complex wave, complex pointed-and- 
medium wave, and another epileptic form of signals [4]. Thus, EEG recordings help the researchers with 
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greater efficiency and less time consumption for automated epilepsy diagnosis than visual inspection of 

seizures. 

Long EEG recordings can recognize the seizure period rather than a short duration of time. 
Therefore, the long-term EEG signals can significantly categorize the interictal and ictal periods. The EEG 
signal is a long-term recording of the brain signal, segregated into five different frequency bands to properly 
analyze seizure and non-seizure patterns [5]. This non-uniform and non-stationary signal is time-dependent 
and characterized by the repetitive high amplitude and combination of spike and slow waves [6]. 

The EEG signals are highly non-stationary. These signals are analyzed with frequency, time, and 
time-frequency decomposition methods involving discrete wavelet transform, particularly for feature 
extraction [7]. The visual finding of seizures is time-consuming and tedious to identify the exact duration of 
the physicians' ictal episodes. Therefore, the researchers implemented different automated seizure prediction 
algorithms to enhance the precise detection of epileptic seizures from electroencephalogram recordings. 
Li et al. [8] proposed other soft computing techniques like genetic algorithm (GA), fuzzy logic approaches 
are applied to classify the epileptic and non-epileptic EEG segments. Due to the long EEG recordings, the 
size reduction with optimum information about the ictal activity has been challenging for the researchers. 
This problem statement resulted in selecting and extracting features using complex dual-tree transformation 
with variations in granularity level [9]. The recorded signals are featured using the wavelet decomposition 
method and classified using Naive Bayes and K-nearest neighbor (KNN) classifiers to generate fourteen 
combinations of 2-class epilepsy [10]. During the ictal and preictal states, the amount of information inflow 
and outflow between the brain's cortical regions is considered a parameter for seizure classification [11]. 
The research implemented feature extraction and feature selection methods through correlation and random 
forest (RF) for seizure and non-seizure classification [12], [13]. The EEG signals are also helpful for 
detecting other brain disorders like autism, as illustrated in [14], [15]. 

A machine learning (ML) approach for predicting the onset of seizures is proposed in this present 
work. The extreme gradient boosting algorithm (SCLXGB) is implemented to estimate the descriptive 
samples from a trained predictive model. SCLXGB is a seizure classification method using the XGBoost 
model. This method can considerably approach the parallel and distributed calculation to verify the model's 
attributes. XGBoost library [16] is a productive and disseminated application of gradient boosting [17]. 
It provides a faster interpretation of the model with a decreased overfitting than other boosting algorithms [18]. 
The significant contribution of the paper is given below: 

— Design of an efficient classification model for epileptic seizure identification from EEG signal. 

— Extreme gradient boosting is a variant of a gradient boosting model with efficient computational 
flexibility, can eliminate the probability of missing the values. 

— The existing extreme gradient boosting model is modified to estimate the early stopping of epochs for 
training data by computing the root mean square error (RMSE) values for each epoch from the log loss 
graph. Accordingly, the parameters are optimized. 

— The proposed model is compared with the existing classifiers and found to be more accurate withless 
computation time for predicting the seizures. 


2. MATERIALS AND METHOD 

The epileptic seizure data set is taken from the University of California at Irvine (UCI) machine 
learning repository [19] which consists of five states of patients data sets, each with 100 files. Each of the 
100 files represents a single subject's brain activity record for 23.6 seconds. The data is a time-series signal 
and is sampled in 4097 data points. Each signal is then processed by dividing and shuffling each set of 4097 
data points into 23 pieces [19]. This time series dataset has 178(4097/23) data points for 1 second in every 23 
pieces. The data point represents the EEG recording value on separate time instances. Thus a total of 
23x500=11500 pieces of information containing 178-dimensional input vectors represented for 1 second by 
X1, X2, X3 .....: X178. There are five patient statuses represented by y{1, 2, 3, 4, 5} as indicated [19]: 
— EEG Recording of instances of seizures (ES). 
— EEG signal from brain tumour site (ETB). 
— EEG Record from a healthy area of the brain (EHB). 
— EEG signal of the healthy subject with eyes closed (EYEC). 
— EEG signal of the healthy subject with eyes open (EYEO). 

The subjects in categories 2, 3, 4, 5 have no epilepsy, and the subject in category | has epilepsy. 
The data processing is done such that it is easier for binary classification against the rest of the classes. Figure 1 
represents the five states of the subjects. 
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Different states of the patients 


No.of information of each patient 


No.of data points of each patient 


Figure 1. An illustration of EEG recordings 


3. METHOD 

Figure 2 represents the flow diagram of the proposed method. Firstly the benchmark dataset is 
considered for classification and then data preprocessing is done using filtering technique. Then training and 
testing dataset is considered performing train test split with 70% train data and 30% test data. After that the 
classification is done using SCLXGB classifier for seizure prediction. The detailed explanation is illustrated 
in Figure 2. 


| Seizure dataset from UCI repository 
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Data Preprocessing 


Train-Test Split 


| 


| Train Set (70%) | Test Set (30%) 


Seizure Classification jusing XGBoost (SCLXGB) 
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Figure 2. The flow diagram of the proposed model SCLXGB 


3.1. Data preprocessing 

The dataset used in our study for epileptic seizure detection is publicly available and easily 
downloadable [19]. The EEG recordings are multivariate time series signals sampled at 173.61 Hz having a 
spectral bandwidth of 0.5 Hz to 85 Hz. The preprocessing is done by applying a low pass filter at a frequency 
of 40 Hz; the data is downsampling is carried out to tune with the classifiers. Figure 1 represents the different 
dataset levels represented in terms of voltage levels concerning time. The dataset having the seizure class is 
marked as 1, and other datasets are marked as 2, 3, 4, 5, respectively, representing different states of the 
subjects. We only concentrate on the dataset having a seizure and the time duration of the occurrence of the 
seizure. 


3.2. Extreme gradient boosting (SCLXGB) 


The XGBoost algorithm [16] incorporates a more regularized model to manage the overfitting of the 
attributes by implementing a parallel processing technique with a varying number of terminal nodes. They 
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comprise an array of split points and the nodes represented as input variables. Here the last node is the leaf of 
the trees that gives the specific values of output variables. The leaf weights having less information about the 
occurrence of seizures are shrieked to reduce model complexity. XGBoost algorithm has two significant 
upgradations than other boosting algorithms as: 
— Itis robust and speeds up the construction of the tree. 
— Proposing a new distributed algorithm for tree searching and extra randomization parameter is applied to 
decrease the correlation between trees. 
— XGB provides a more regularized structure to manage overfit and allows parallel processing better than 
the gradient boosting model. 
— The novel contribution of this paper is that the parameters are optimized to obtain higher accuracy with 
reduced complexity of the model. 
The extreme gradient boosting (XGBoost) algorithm [16], [20] was implemented using the Scikit- 
learn [21] Python modules. 


3.3. Model description 

The EEG signal classification has become an arduous task for medical practitioners to detect an 
epileptic patient's seizure and non-seizure activity [22]. The main idea behind this model is to carry out a 
more automated binary classification of the EEG signal using a state-of-the-art XGBoost learning model 
(SCLXGB) that outperforms other seizure class-determination models. Extreme gradient boosting (XGBoost) 
[23] is a breakthrough among ensemble learning models that incorporate separate variables without 
overfitting directly and can handle non-linear signals such as EEG. After data preprocessing, the overall steps 
carried out are as follows. 
— The training dataset is 70%, and the testing dataset is 30%. Out of 11500 data points, 8050 data points are 

considered for the training data set, and 3450 data are taken for testing. 

— Test dataset is used to validate the SCLXGB model. 
— Validated model is used for prediction 


Algorithm 1: proposed model SCLXGB 

Input: UCI machine learning repository 

Output: performance metrics and area under the curve (AUC) 

1s Data preprocessing and generating the feature set. 
Normalizing the data using StandardScaler. 

Modeling as a classification problem. 

Train-test split. 

Train and tuning the XGBoost Model using all the features. 
Calculating the classification metrics. 

Plot the train and test confusion matrix and ROC. 


AAO PWD 


Algorithm 1 represents the steps followed in the proposed model. The train and test error of the 
SCLXGB model is trained with different epochs and the corresponding RMSE value is represented in Figure 3. 
The stochastic nature of the algorithm is estimated by plotting the RMSE values with 70% train data and 30% 
test data sets for each epoch. It is observed that after round 40 epochs the iterations should be stopped to 
avoid overfitting of the training model. 


XGBoost RMSE 
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Figure 3. The learning curve (logloss) of the proposed model SCLXGB 
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4. RESULTS AND DISCUSSION 

The UCI machine learning repository dataset [19] for epileptic seizure detection is considered for 
testing the proposed model. The results obtained are compared with the state of the art models like KNN, 
logistic regression and decision tree and Gaussian Naïve Bayes classifiers [24]-[26]. Gradient boosted 
decision trees (GBDT) is an ensemble learning methodology that combines many decision trees in series. 
XGBoost was initially introduced by Chen and Guestrin [16] and was then further employed by researchers 
due to its efficacy in decreasing processing time and efficiently utilising memory resources. Table 1 shows 
the parameters that are considered for the classification of the models. 


Table 1. Model parameters 


Classifiers Parameters 
KNN Neighbors(k)=5, weights=uniform, leaf_size=30, p=2, metric=minkowski 
Decision tree criterion='gini', splitter='best', min_samples_split=2, min_samples_leaf=1 
SCLXGB base_score=0.5, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, n_estimators=100, 


objective='binary: logistic’, reg_alpha=0, reg_lambda=1 


Table 1 elaborates the parameters set for optimizing the performance of the classifier. Our research 
focused on the model complexity and adjusted the following three proposed model parameters SCLXGB. 
The model works efficiently with less computation time and increased accuracy. The “max_depth” is a 
parameter that emphasizes the model complexity. The higher the value, the more complex the model 
becomes [27]. The “min_child_weight” always takes positive integer values. It further evaluates the splitting 
of nodes if the sum of weights is greater than the peak value. This parameter makes the algorithm more 
customary as the value increases. The third parameter is the “learning rate”, which is tuned to prevent the 
system from being less overfitting and more robust. It shrinks the step size so that it provides more area for 
further enhancement. The classifier's performance is evaluated by the metrics such as accuracy, precision, 
recall, fl-measure and AUC are calculated. The performance metrics of various models are illustrated in 
Table 2. A higher value indicates a better result. 

From the above table it shows that KNN exhibits better precision of 98.3% than any other model for 
this benchmark dataset. It can be seen that Logistic regression and decision tree model does not have a 
remarkable performance in this context. LR has an accuracy of 73.6% which is very low and DT has an 
accuracy of about 89.9 which is lower as compared to the proposed model. The proposed model SCLXGB 
outperforms other classifiers in recall, Fl measure, accuracy and AUC. It has achieved an accuracy of 96% 
with less computation time. The performance of the system can be evaluated in terms of computation time 
comparing with KNN and DT [28]. 

From the graph shown in Figure 4 the computational time or the elapsed training time for the 
proposed model SCLXGB is 2.66 ms and that of KNN is 7.87 ms and DT is 3.15 ms. The confusion, 
precision and recall matrix of K nearest neighbor shown in Figure 5(a), linear regression in Figure 5(b), 
decision tree in Figure 5(c), Naive Bayes and proposed model SCLXGB classifier in Figure 5(d). 


Table 2. Performance metrics of test data 


Model Accuracy Precision _ Recall Fl Score __ AUC 
KNN 0.931 0.983 0.678 0.803 0.8378 
LR 0.736 0.377 0.438 0.406 0.6259 
DT 0.898 0.708 0.860 0.777 0.8844 
GNB 0.958 0.907 0.894 0.899 0.9349 
SCLXGB (Proposed) 0.960 0.892 0.916 0.904 0.9462 
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Figure 4. Computational time of the classifiers 
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Figure 5. Confusion matrix: (a) KNN, (b) LR, (c) DT, and (d) SCLXGB 
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The AUC of the various classification models is depicted in Figure 6. Higher AUC value reflects a 
better model. It is seen that the proposed model exhibits better performance as compared to other models. 
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Figure 6. AUC of classification models on test data 


5. CONCLUSION 


In this present study, different ML classifiers are compared to obtain the ictal period of seizures in 
epileptic patients. In the form of the above investigation, it is found that the binary classification of seizures 
outperforms other models in terms of regularization and minimization of computational time. Seizure 
classification using the SCLXGB model is a novel approach and provides the best AUC of 0.9462 and test 
accuracy of 96% compared to other machine learning models such as KNN, linear regression, decision tree 
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and Naïve Bayes models. After pre-processing, the data train test splitting is carried out, and 70% of the train 
and 30% of the test data are considered for validation of the model. The area under the ROC curve was 
0.9462 for the proposed model. Compared to the KNN, LR decision tree and Naïve Bayes models, SCLXGB 
is the most acceptable classification model. The proposed model provides for a suitable regularization that 
prevents it from being over-fitted. SCLXGB integrates data sparsity with a split-finding algorithm to manage 
different types of data sparsity patterns. The data is organized in memory cells called blocks to be reused in 
repeated iterations instead of re-calculated. This helps to reduce the computation time of 2.66 ms and makes 
the model more robust for classification. In SCLXGB, discontinuous memory access is required to obtain 
gradient information by row indicator for optimum hardware utilization. This is done by assigning internal 
buffers to each thread where gradient statistics can be stored. This feature enables optimized use of available 
disc space with a large dataset handling capability. 
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